ALIEN-59: K8s agent self-update — chart guardrails, /v1/rejoin recovery, persistence default by yonatan-alien · Pull Request #69 · alienplatform/alien

yonatan-alien · 2026-06-03T12:59:33Z

Summary

Closes ALIEN-59. Manager-driven agent self-update on Kubernetes, plus the two production-hardening fixes that came out of live e2e.

Feature

New agent_upgrade loop in alien-agent consumes agent_target.helm on /v1/sync and spawns a Helm-runner Job that runs helm upgrade --reuse-values --set runtime.image.tag=<target> under an alien-agent-upgrader ServiceAccount.
alien-helm chart guardrails — required management.{url,name,token,deploymentId}, /livez+/readyz probes, Recreate strategy, namespace-scoped upgrader Role/Binding, upgrader.enabled opt-out.
Sync schema gains agent_version / agent_os / agent_arch / regime / agent_image_repository; K8s worker transport set to local; K8s ServiceAccount controller is skipped (chart owns it).
alien-manager exposes PUT /v1/deployments/:id/target-agent-version; build_agent_target emits agent_target.helm when target ≠ reported and regime = kubernetes.

Production hardening (would silently break otherwise; both found via live e2e)

fix(helm): enable agent runtime-data persistence by default — without a PVC, the self-update's pod roll wipes the agent's deployment_id/sync token, the new pod hits 409 from initialize, CrashLoopBackOffs, and helm --atomic rolls back. Documented opt-out.
feat(agent,manager): /v1/rejoin endpoint — for the case where persistence is genuinely off (or the state directory is wiped for any other reason), the agent's initialize-409 falls through to a new POST /v1/rejoin that re-acquires a deployment-scoped sync token over the existing row. New RouterOptions::include_rejoin + skip_rejoin() builder lets multi-tenant embedders override, same pattern as /v1/initialize.
fix(agent): preserve 409 from initialize so rejoin fall-through fires — the agent's .context(ConfigurationError) was clobbering the SDK's real 409 with a hardcoded 500. New DeploymentNameAlreadyExists variant (409) lets the conflict surface so the rejoin branch actually triggers.

Paired platform-side PR: https://github.com/alienplatform/platform/pull/new/yonatan/alien-59-self-update-for-the-alien-agent-across-all-pull-mode-2

Test plan

cargo test -p alien-agent -p alien-core -p alien-deployment -p alien-helm -p alien-manager --lib
Live e2e against local managerx + orbstack K8s: pin a target version in the dashboard → manager emits agent_target → helm-upgrade Job rolls the agent pod → PVC keeps deployment_id → new agent reports the new version
Live e2e for rejoin: kubectl exec rm -rf /var/lib/alien-agent/* + bounce pod → agent logs Name already exists — assuming local state was wiped, rejoining… → recovers the original deployment_id and stays running
Reviewer to confirm: client-sdks/platform/rust/openapi*.json kept ours during the main merge to preserve the rejoin/target-agent-version schemas; main's added ManagerDomainBinding* + PackageDomain schemas in those files were dropped (they were generated from a private platform branch). Next platform-side openapi regen will bring them back.

…IEN-59) Squashed replay of the K8s self-update feature onto current main: - `alien-agent`: new `agent_upgrade` loop that consumes `agent_target.helm` from /v1/sync and spawns a Helm-upgrade Job using the upgrader SA; sync loop persists deployment-scoped token across restarts. - `alien-helm`: chart guardrails — `required management.{url,name,token, deploymentId}`, /livez + /readyz probes, Recreate strategy, upgrader SA + namespace-scoped Role, `upgrader.enabled` values block. - `alien-core`: sync schema gains `agent_version`, `agent_os`, `agent_arch`, `regime`, `agent_image_repository`; K8s worker transport set to `local`. - `alien-manager`: PUT /v1/deployments/:id/target-agent-version; sync emits `agent_target` when `target_agent_version` differs from the agent's reported version; sqlite store + migrations + ReconcileData carry the new inventory; sync-token persistence reuses across restarts. - `alien-infra`: K8s ServiceAccount controller is skipped — the chart owns it, controllers reference it by name. - `alien-deployment`: Kubernetes platform routes through local env-info collection (k8s-on-cloud picks the base cloud via existing context).

Copy the regenerated `apps/api/openapi.json` from the platform repo so the Rust SDK (`alien-platform-api`, generated via `progenitor` at build time) picks up the new `PUT /v1/deployments/:id/target-agent-version` endpoint, the extended SyncReconcileRequest fields, and the `basePlatform` property on `PersistImportedDeploymentRequest`.

Without a PVC backing `/var/lib/alien-agent`, the agent's `deployment_id` + deployment-scoped sync token live in an `emptyDir` and are wiped on any pod restart. The self-update flow (manager emits `agent_target.helm` → agent spawns helm-upgrade Job → Job rolls the agent Deployment) triggers exactly such a restart, and the new pod re-runs `/v1/initialize`, hits a DEPLOYMENT_NAME_ALREADY_EXISTS 409, CrashLoopBackOffs, helm `--atomic` times out, and rolls back. The feature is silently unusable. Default `runtime.data.persistence.enabled` to `true` so self-update survives the pod roll on any cluster with a default StorageClass. Operators on clusters without one can still set `storageClassName`, point at an `existingClaim`, or explicitly opt out and accept that self-update won't survive a pod roll. Helm snapshot tests rebaselined.

The agent's `data_dir` (`/var/lib/alien-agent`) holds its persistent `deployment_id` + deployment-scoped sync token. When the chart leaves `runtime.data.persistence` off (or anything else wipes that directory — pod eviction, manual reset, etc.) the agent restarts with no stored state and falls into the "first startup" branch, hits `/v1/initialize`, and the platform rejects with `DEPLOYMENT_NAME_ALREADY_EXISTS 409` because the row is still there. Result: CrashLoopBackOff, and if the restart was driven by self-update's helm-upgrade Job, helm `--atomic` times out and rolls back. Add `POST /v1/rejoin` so the agent can explicitly re-attach to an existing deployment by name. Same dg bearer the chart mounts, no body beyond the name; response is `{ deploymentId, token }` with a freshly minted deployment-scoped token. The agent code-path catches the 409 from initialize and falls through to rejoin instead of erroring. - `alien-manager`: new `RejoinRequest`/`RejoinResponse`, `rejoin` handler that mirrors the existing initialize idempotency branch (look up by `(dg, name)`, mint a fresh token), wired through a new `rejoin_router()` so multi-tenant embedders can override it the same way they override initialize. `RouterOptions::include_rejoin` + a `skip_rejoin()` builder method follow the existing pattern. - `alien-agent`: on 409 from initialize, resolve the deployment name the same way initialize did (CLI arg / env / hostname) and call the new `rejoin_with_manager` helper. The recovered `(deployment_id, token)` are written back to `data_dir` like a normal init, so subsequent restarts go through the stored-state branch. - `alien-manager-api` SDK regenerated; OSS `/v1/rejoin` is exposed in the openapi schema for downstream tooling.

The agent's initialize → rejoin fall-through keyed off `http_status_code == Some(409)`, but `initialize_with_manager` wrapped every SDK error in `ConfigurationError` whose hardcoded `http_status_code = 500` clobbered the real status. Result: agent received "Unexpected response: 409 Conflict" from the manager, hit the `.context(ConfigurationError)` wrap, the outer status became 500, and the agent crashed instead of falling through to `/v1/rejoin`. - New `DeploymentNameAlreadyExists` variant on `ErrorData` (409, non-retryable, non-internal) so the 409 path has a distinct outer error. - `initialize_with_manager` rebuilt: 409 from the SDK is re-emitted as `DeploymentNameAlreadyExists`; everything else still goes through `ConfigurationError` so existing error paths are unchanged. Confirmed live: after wiping `/var/lib/alien-agent/*` and bouncing the pod, the agent now logs `Name already exists — assuming local state was wiped, rejoining…`, calls `/v1/rejoin`, recovers the same deployment_id, and stays running.

…-update-for-the-alien-agent-across-all-pull-mode-2 # Conflicts: # client-sdks/platform/openapi.json # client-sdks/platform/rust/openapi-3.0.json # client-sdks/platform/rust/openapi.json # crates/alien-helm/src/generator.rs # crates/alien-helm/tests/generator/boot_paths_tests.rs # crates/alien-helm/tests/generator/snapshots/generator__generator__helpers__data_layer.snap # crates/alien-helm/tests/generator/snapshots/generator__generator__helpers__manager_only_pure_worker.snap

greptile-apps · 2026-06-03T13:07:59Z

Greptile Summary

This PR lands manager-driven K8s agent self-update (Helm-runner Job spawned from the agent on /v1/sync), the /v1/rejoin recovery flow for wiped persistent state, and the chart guardrails (Recreate strategy, PVC-on-by-default, required value guards, /livez+/readyz probes) that make the upgrade cycle safe.

K8s self-update loop (agent_upgrade.rs): agent reads agent_target.helm from the sync response, resolves chart ref + runner image from env vars injected by the chart, and POSTs a one-shot Helm-runner Job to the in-cluster API using the alien-agent-upgrader SA.
/v1/rejoin endpoint (manager + agent cli.rs): catches the 409 from /v1/initialize (name already exists after a state wipe), re-acquires a fresh deployment-scoped sync token over the existing row, and persists it to the local SQLite state so pod restarts don't fall back to the chart-mounted DG token.
Chart hardening (generator.rs): PVC enabled by default, required template guards on management.*, Recreate rollout strategy, namespace-scoped upgrader Role/ClusterRoleBinding, and auto-generated encryption key via lookup.

Confidence Score: 3/5

The self-update happy path is solid, but a failed helm-runner Job silently blocks all retry attempts for up to 10 minutes with no operator warning.

The core upgrade loop, rejoin recovery, chart guardrails, and persistence defaults are well-reasoned and tested. The 409-on-failed-Job case in create_job quietly absorbs what should be a retryable error, and min_supported_version being pinned to the target version would block all upgrades once the agent enforces that field. The upgrade path is a new critical flow touching the agent self-replacement mechanism, so these gaps carry more weight than they would in a peripheral feature.

crates/alien-agent/src/loops/agent_upgrade.rs (failed-Job 409 handling) and crates/alien-manager/src/routes/sync.rs (build_agent_target min_supported_version) need the most attention before this ships to production.

Important Files Changed

Filename	Overview
crates/alien-agent/src/loops/agent_upgrade.rs	New Helm-runner Job spawner; 409 from a failed (but not yet TTL'd) Job is silently treated as idempotent success, stalling upgrades for up to 10 min with no operator warning.
crates/alien-manager/src/routes/sync.rs	New agent_sync inventory persistence and build_agent_target; min_supported_version is set equal to the target version, which would block all agents from upgrading once that field is enforced.
crates/alien-manager/src/routes/deployments.rs	New PUT /v1/deployments/:id/target-agent-version endpoint; auth, authz, and semver validation are present; is_plausible_semver rejects valid versions with both pre-release and build metadata.
crates/alien-helm/src/generator.rs	Chart guardrails (required fields, Recreate strategy, persistence default, upgrader SA/Role/ClusterRole), encryption key auto-generation, and /livez+/readyz probe defaults all look correct.
crates/alien-agent/src/cli.rs	Adds rejoin fall-through on 409 from initialize and sync-token persistence; logic and error routing look correct.
crates/alien-manager/src/stores/sqlite/deployment.rs	update_agent_metadata and set_target_agent_version are correctly scoped; neither updates the updated_at column, leaving deployment timestamps stale after inventory syncs.
crates/alien-core/src/sync.rs	SyncRequest/SyncResponse extended with new optional fields; AgentTarget, AgentHelmTarget, AgentBinaryTarget structs and back-compat tests look correct.
crates/alien-infra/src/core/executor.rs	Kubernetes ServiceAccounts are correctly skipped at controller-lookup time and auto-marked Running as platform-provided resources; logic mirrors external-binding treatment.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Manager
    participant K8sAPI
    participant HelmRunnerJob

    Note over Agent,Manager: Normal sync loop
    Agent->>Manager: "POST /v1/sync (agent_version, regime=kubernetes)"
    Manager->>Manager: update_agent_metadata()
    Manager->>Manager: build_agent_target() if target differs from reported
    Manager-->>Agent: SyncResponse with agent_target helm payload

    Note over Agent,K8sAPI: Self-update actuator
    Agent->>K8sAPI: POST /apis/batch/v1/.../jobs (helm-runner Job)
    K8sAPI-->>Agent: 201 Created (or 409 if exists)
    K8sAPI->>HelmRunnerJob: schedule pod
    HelmRunnerJob->>HelmRunnerJob: helm upgrade --reuse-values --atomic --wait

    Note over Agent,Manager: Pod rolls (Recreate strategy)
    HelmRunnerJob->>K8sAPI: Deployment updated with new image tag
    Agent->>Agent: /readyz returns 503 until first sync
    Agent->>Manager: POST /v1/sync with new agent_version
    Agent->>Agent: "first_sync_completed=true, /readyz returns 200"

    Note over Agent,Manager: Rejoin after state wipe
    Agent->>Manager: POST /v1/initialize returns 409
    Agent->>Manager: POST /v1/rejoin with name
    Manager->>Manager: get_deployment_by_name() and mint new token
    Manager-->>Agent: deployment_id and fresh token
    Agent->>Agent: db.set_sync_token() persists for next restart

Comments Outside Diff (1)

crates/alien-manager/src/stores/sqlite/deployment.rs, line 506-550 (link)

updated_at is not bumped during agent metadata writes

update_agent_metadata (and similarly set_target_agent_version) never touches the updated_at column. Any API consumer that uses updated_at to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting updated_at = datetime('now') inside the same UPDATE, or at least doing so for set_target_agent_version which represents an explicit admin action that callers would expect to be reflected in the record's timestamp.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/alien-manager/src/stores/sqlite/deployment.rs
Line: 506-550

Comment:
**`updated_at` is not bumped during agent metadata writes**

`update_agent_metadata` (and similarly `set_target_agent_version`) never touches the `updated_at` column. Any API consumer that uses `updated_at` to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting `updated_at = datetime('now')` inside the same UPDATE, or at least doing so for `set_target_agent_version` which represents an explicit admin action that callers would expect to be reflected in the record's timestamp.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
crates/alien-agent/src/loops/agent_upgrade.rs:296-305
**Silently swallows a failed Job's 409 — upgrade stalls with no warning**

When the Helm-runner Job exists but is in a `Failed` state (exhausted its `backoffLimit`), the Kubernetes API still returns 409 on a new create. The code logs at `info` level and returns `Ok(())`, so `apply_agent_target` sees success. The manager keeps emitting `agent_target` on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to `ttlSecondsAfterFinished` seconds (10 minutes). During that window, there is no `warn` log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.

To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a `warn` when the existing Job is in a `Failed` condition, so operators can investigate before the TTL window expires.

### Issue 2 of 4
crates/alien-manager/src/routes/sync.rs:1107-1112
`min_supported_version` is set equal to `version`, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the `reported_version == Some(target_version)` guard returns `None` first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like `"1.0.0"`, or by explicitly recording the minimum agent version that is safe to self-upgrade.

```suggestion
    Some(alien_core::sync::AgentTarget {
        version: target_version.to_string(),
        // Allow all existing agents to receive the upgrade instruction.
        // Set to a meaningful floor (e.g. the first version that supports
        // self-update) once the agent enforces this field.
        min_supported_version: "1.0.0".to_string(),
        binary: None,
        helm,
    })
```

### Issue 3 of 4
crates/alien-manager/src/routes/deployments.rs:876-892
`splitn(2, |c| c == '-' || c == '+')` splits on the **first** occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. `1.4.0-rc.1+build.123`) fail: the remainder becomes `"rc.1+build.123"` and the `+` isn't in the allowed character set `[alphanumeric, '.', '-']`. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.

```suggestion
fn is_plausible_semver(v: &str) -> bool {
    // Strip optional +build metadata first so the pre-release check can
    // use a simple split on '-'.
    let without_build = v.splitn(2, '+').next().unwrap_or(v);
    let mut parts = without_build.splitn(2, '-');
    let core = parts.next().unwrap_or("");
    let core_ok = {
        let segs: Vec<&str> = core.split('.').collect();
        segs.len() == 3
            && segs
                .iter()
                .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
    };
    let rest_ok = parts.next().map_or(true, |r| {
        !r.is_empty()
            && r.chars()
                .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
    });
    core_ok && rest_ok
}
```

### Issue 4 of 4
crates/alien-manager/src/stores/sqlite/deployment.rs:506-550
**`updated_at` is not bumped during agent metadata writes**

`update_agent_metadata` (and similarly `set_target_agent_version`) never touches the `updated_at` column. Any API consumer that uses `updated_at` to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting `updated_at = datetime('now')` inside the same UPDATE, or at least doing so for `set_target_agent_version` which represents an explicit admin action that callers would expect to be reflected in the record's timestamp.

_{Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile}

greptile-apps · 2026-06-03T13:08:03Z

+    }
+    out
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use serde_json::json;
+
+    fn inputs() -> JobInputs {


Silently swallows a failed Job's 409 — upgrade stalls with no warning

When the Helm-runner Job exists but is in a Failed state (exhausted its backoffLimit), the Kubernetes API still returns 409 on a new create. The code logs at info level and returns Ok(()), so apply_agent_target sees success. The manager keeps emitting agent_target on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to ttlSecondsAfterFinished seconds (10 minutes). During that window, there is no warn log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.

To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a warn when the existing Job is in a Failed condition, so operators can investigate before the TTL window expires.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/alien-agent/src/loops/agent_upgrade.rs Line: 296-305 Comment: **Silently swallows a failed Job's 409 — upgrade stalls with no warning** When the Helm-runner Job exists but is in a `Failed` state (exhausted its `backoffLimit`), the Kubernetes API still returns 409 on a new create. The code logs at `info` level and returns `Ok(())`, so `apply_agent_target` sees success. The manager keeps emitting `agent_target` on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to `ttlSecondsAfterFinished` seconds (10 minutes). During that window, there is no `warn` log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened. To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a `warn` when the existing Job is in a `Failed` condition, so operators can investigate before the TTL window expires. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-03T13:08:03Z

+    Some(alien_core::sync::AgentTarget {
+        version: target_version.to_string(),
+        min_supported_version: target_version.to_string(),
+        binary: None,
+        helm,
+    })


min_supported_version is set equal to version, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the reported_version == Some(target_version) guard returns None first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like "1.0.0", or by explicitly recording the minimum agent version that is safe to self-upgrade.

Suggested change

Some(alien_core::sync::AgentTarget {

version: target_version.to_string(),

min_supported_version: target_version.to_string(),

binary: None,

helm,

})

Some(alien_core::sync::AgentTarget {

version: target_version.to_string(),

// Allow all existing agents to receive the upgrade instruction.

// Set to a meaningful floor (e.g. the first version that supports

// self-update) once the agent enforces this field.

min_supported_version: "1.0.0".to_string(),

binary: None,

helm,

})

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/alien-manager/src/routes/sync.rs Line: 1107-1112 Comment: `min_supported_version` is set equal to `version`, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the `reported_version == Some(target_version)` guard returns `None` first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like `"1.0.0"`, or by explicitly recording the minimum agent version that is safe to self-upgrade. ```suggestion Some(alien_core::sync::AgentTarget { version: target_version.to_string(), // Allow all existing agents to receive the upgrade instruction. // Set to a meaningful floor (e.g. the first version that supports // self-update) once the agent enforces this field. min_supported_version: "1.0.0".to_string(), binary: None, helm, }) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-03T13:08:05Z

+fn is_plausible_semver(v: &str) -> bool {
+    let mut parts = v.splitn(2, |c| c == '-' || c == '+');
+    let core = parts.next().unwrap_or("");
+    let core_ok = {
+        let segs: Vec<&str> = core.split('.').collect();
+        segs.len() == 3
+            && segs
+                .iter()
+                .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
+    };
+    let rest_ok = parts.next().map_or(true, |r| {
+        !r.is_empty()
+            && r.chars()
+                .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
+    });
+    core_ok && rest_ok
+}


splitn(2, |c| c == '-' || c == '+') splits on the first occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. 1.4.0-rc.1+build.123) fail: the remainder becomes "rc.1+build.123" and the + isn't in the allowed character set [alphanumeric, '.', '-']. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.

Suggested change

fn is_plausible_semver(v: &str) -> bool {

let mut parts = v.splitn(2, |c| c == '-' || c == '+');

let core = parts.next().unwrap_or("");

let core_ok = {

let segs: Vec<&str> = core.split('.').collect();

segs.len() == 3

&& segs

.iter()

.all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))

};

let rest_ok = parts.next().map_or(true, |r| {

!r.is_empty()

&& r.chars()

.all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')

});

core_ok && rest_ok

}

fn is_plausible_semver(v: &str) -> bool {

// Strip optional +build metadata first so the pre-release check can

// use a simple split on '-'.

let without_build = v.splitn(2, '+').next().unwrap_or(v);

let mut parts = without_build.splitn(2, '-');

let core = parts.next().unwrap_or("");

let core_ok = {

let segs: Vec<&str> = core.split('.').collect();

segs.len() == 3

&& segs

.iter()

.all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))

};

let rest_ok = parts.next().map_or(true, |r| {

!r.is_empty()

&& r.chars()

.all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')

});

core_ok && rest_ok

}

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/alien-manager/src/routes/deployments.rs Line: 876-892 Comment: `splitn(2, |c| c == '-' || c == '+')` splits on the **first** occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. `1.4.0-rc.1+build.123`) fail: the remainder becomes `"rc.1+build.123"` and the `+` isn't in the allowed character set `[alphanumeric, '.', '-']`. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin. ```suggestion fn is_plausible_semver(v: &str) -> bool { // Strip optional +build metadata first so the pre-release check can // use a simple split on '-'. let without_build = v.splitn(2, '+').next().unwrap_or(v); let mut parts = without_build.splitn(2, '-'); let core = parts.next().unwrap_or(""); let core_ok = { let segs: Vec<&str> = core.split('.').collect(); segs.len() == 3 && segs .iter() .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit())) }; let rest_ok = parts.next().map_or(true, |r| { !r.is_empty() && r.chars() .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-') }); core_ok && rest_ok } ``` How can I resolve this? If you propose a fix, please make it concise.

alongubkin · 2026-06-04T10:05:46Z

+/// POST a Job to the in-pod Kubernetes API. Returns Ok on 2xx, AlienError
+/// otherwise. 409 (already exists) is treated as success — another sync
+/// already kicked the Job and idempotency handles the retry.
+async fn create_job(namespace: &str, body: &serde_json::Value) -> Result<()> {


these APIs should be in alien-k8s-clients

alongubkin · 2026-06-04T10:06:48Z

+    pub agent_version: Option<String>,
+    /// `linux` / `macos` / `windows`. From `std::env::consts::OS`.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub agent_os: Option<String>,


I think we have an existing enum for this somewhere in alien-core

alongubkin · 2026-06-04T10:06:52Z

+    pub agent_os: Option<String>,
+    /// `x86_64` / `aarch64`. From `std::env::consts::ARCH`.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub agent_arch: Option<String>,


I think we have an existing enum for this somewhere in alien-core

alongubkin · 2026-06-04T10:07:17Z

+    /// How the agent is supervised — `os-service` (launcher) or `kubernetes`
+    /// (Helm). Detected at runtime from `KUBERNETES_SERVICE_HOST`.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    pub regime: Option<AgentRegime>,


regime is a really weird name haha

alongubkin · 2026-06-04T10:10:19Z

+         exec helm upgrade \"$RELEASE\" \"$CHART_REF\"{version_flag}{extra_args_flag} \
+            --namespace \"$NAMESPACE\" \
+            --reuse-values \
+            --atomic --wait \


Can you document why did we choose --atomic?

alongubkin · 2026-06-04T10:13:33Z

+            --values /tmp/values.json"
+    );
+
+    let body = serde_json::json!({


What happens if creating this job fails? Or pulling the job's image fails? Or the command inside it fails? Do we report back those failures in any way?

How many retries do we support (I think the answer should be unlimited)? Is there exponential backoff or constant interval between retries?

alongubkin · 2026-06-04T10:13:51Z

+            "name": job_name,
+            "namespace": inputs.namespace,
+            "labels": {
+                "app.kubernetes.io/managed-by": "alien-agent",


We recently improved our white-labeling support, need to make sure it fits there. See other places in alien-helm

alongubkin · 2026-06-04T10:14:27Z

@@ -0,0 +1,453 @@
+//! Agent self-update actuator. When `/v1/sync` returns


Right now this will only work for Kubernetes deployments, right? No support for local service swap etc? What'll happen in local?

Yonatan Schkolnik added 7 commits June 2, 2026 15:43

test(helm): re-baseline data_layer + manager_only_pure_worker snapshots

d940ead

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

alongubkin requested changes Jun 4, 2026

View reviewed changes

		@@ -0,0 +1,453 @@
		//! Agent self-update actuator. When `/v1/sync` returns

Conversation

yonatan-alien commented Jun 3, 2026

Summary

Test plan

Uh oh!

greptile-apps Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading